Skip to content

Conversation

maxesse
Copy link
Contributor

@maxesse maxesse commented Aug 20, 2025

Closes #3603

The change excludes specific /contentstorage/ urls from the sync in all API calls to Sharepoint. These URLs should not be attempted to be accessed as they're created internally by Sharepoint for Teams private channels, loop components, etc. (it's fairly undocumented what they're used for to be honest), and have a different permission model that will cause 401 errors and the connector to stop syncing.

Checklists

Pre-Review Checklist

  • this PR does NOT contain credentials of any kind, such as API keys or username/passwords (double check config.yml.example)
  • this PR has a meaningful title
  • this PR links to all relevant github issues that it fixes or partially addresses
  • this PR has a thorough description
  • Tested the changes locally
  • For bugfixes: backport safely to all minor branches still receiving patch releases

Release Note

Fixes an issue where a Sharepoint Online sync configured to crawl the entire tenant by selecting * in the site list, might stop with 401 errors when trying to access URLs containing /contentstorage/.

@maxesse maxesse requested a review from a team as a code owner August 20, 2025 09:13
Copy link

cla-checker-service bot commented Aug 20, 2025

💚 CLA has been signed

@artem-shelkovnikov
Copy link
Member

buildkite test this

@maxesse
Copy link
Contributor Author

maxesse commented Aug 20, 2025

Ok I signed the contributor agreement too, hopefully the cla-checker-service will do another pass.

@artem-shelkovnikov
Copy link
Member

CLA checked is happy, but test coverage fell below 92% now - you can verify it by running make ftest.

I also see that you've added the check in a lot of places - is it needed?

Theoretically, if you just ignore the site itself, you won't need to propagate the change down to all entities, or is it not the case?

@maxesse maxesse force-pushed the fix/sharepoint-contentstorage-urls branch from 319658c to a433cff Compare October 6, 2025 14:59
@maxesse
Copy link
Contributor Author

maxesse commented Oct 6, 2025

CLA checked is happy, but test coverage fell below 92% now - you can verify it by running make ftest.

I also see that you've added the check in a lot of places - is it needed?

Theoretically, if you just ignore the site itself, you won't need to propagate the change down to all entities, or is it not the case?

Hi Artem - sorry for the long silence, between holidays and lots of other stuff I didn't have time to fix this. You're right, it should be enough to keep the check in the core places. I updated it.

@artem-shelkovnikov
Copy link
Member

buildkite test this

@artem-shelkovnikov
Copy link
Member

buildkite test this

@artem-shelkovnikov
Copy link
Member

Just to check, what is contentstorage - it's a a site not a List or another entity?

@maxesse
Copy link
Contributor Author

maxesse commented Oct 16, 2025

Just to check, what is contentstorage - it's a a site not a List or another entity?

Hello Artem, it's a special URL that Sharepoint uses to manage content for private channels in teams and other entities like loop. Even with sites.fullcontrol you won't be able to expand permissions or content for these URLs and they'll cause the connector to fail. They got mentioned a few times but there's very little available info overall. https://learn.microsoft.com/en-us/answers/questions/5219021/where-are-loop-workspace-pages-actually-stored

https://learn.microsoft.com/en-us/sharepoint/dev/embedded/overview

Because these containers are not “classic SharePoint sites/libraries” the way you might expect:

  1. No UI surface
    These containers don’t show up in the SharePoint Admin Center, site collections list, or site UI in the usual way. They’re “hidden.” 
  2. Permissions are app-scoped / container-scoped
    The owning application (or container type) has the model for how access is managed. Normal site or site-collection “Full Control” does not necessarily include the ability to touch these embedded containers. 
  3. Headless / API-only model
    Because this is intended for apps, Microsoft designed it so that you must use APIs (Graph, Embedded APIs) to manage containers, rather than the standard SharePoint GUI. 
  4. Isolation & compliance semantics
    Microsoft intends these containers to support compliance, governance, and storage partitioning in a more controlled way (they count toward tenant storage, etc.). That design means more access boundaries. 

Thus, even if you have full control on a related “site” or “workspace,” those permissions don’t automatically bridge into the embedded / contentstorage container.

@artem-shelkovnikov
Copy link
Member

Thanks for the writeout!

For me it was important to check what API returns this entity - and it seems like the sites API returns the entity that we later try to crawl and fail.

Then your code, that was also added only on the level of the sites, would exclude them and path /contentstorage/ would not appear anywhere else.

I'm double checking because we've had similar experience with special Lists that Sharepoint has - they are available via lists API, but it's impossible to enumerate them as their permission model is different seemingly.

@maxesse
Copy link
Contributor Author

maxesse commented Oct 16, 2025

I maybe would suggest that before we merge the updated change, i run a longer sync, say, for a day, to really be sure it doesn't happen anywhere. The previous version with the repeated statements worked 100% but now I am getting doubts that perhaps some paths would still return /contentstorage

@artem-shelkovnikov
Copy link
Member

We'll appreciate if you try doing a long sync with this code, yes!

@artem-shelkovnikov
Copy link
Member

Alternatively, if you can provide the details about how to do the same setup on our side, we'll be happy to reproduce and test ourselves

@maxesse
Copy link
Contributor Author

maxesse commented Oct 16, 2025

I've asked ChatGPT how to trigger their creation, so you might be able to try it on your end:

/contentstorage/... paths get created by apps that use SharePoint Embedded (Loop app, Copilot Pages/Notebooks, and any custom app built on Embedded). They’re not normal “sites,” so Teams private/shared channels won’t make them; those create regular channel sites. To force-create /contentstorage so your Elastic crawler can see them, the quickest repro is to create some Loop content (or a Copilot Page/Notebook). Details + sources below.

How to deliberately create /contentstorage/... URLs

Option A — Microsoft Loop (fastest, no code)

  1. Go to the Loop web app and create a new workspace.
  2. Add a new page, type a few lines, paste a table checklist—anything.
  3. That action provisions a hidden SharePoint Embedded container in your tenant whose URL pattern is: https://.sharepoint.com/contentstorage/CSP_/

You won’t see it in the SPO UI, but if you open browser dev tools while the page loads you’ll see network calls to that /contentstorage/... path. 

Even though the container is “headless,” its HTTP endpoints exist and are what Loop calls; that’s the same surface your site crawler stumbled onto in prod. There are public threads showing the access denied behavior on those URLs, which confirms they’re real but not user-surfaced. 

Option B — Copilot Pages / Copilot Notebooks

  1. Create a Copilot Page or Notebook (the successor family to some Loop scenarios).
  2. This also backs onto SharePoint Embedded containers and will mint /contentstorage/... paths. Admin docs and Purview pages call this out explicitly. 

These urls are app-owned, headless containers: permissions are governed by the owning app/Embedded, not standard site ACLs. You can hit the URLs, but expanding permissions via “Site permissions” won’t work, which matches what you’re seeing. Official docs and reputable write-ups confirm the storage model and the /contentstorage pattern for Loop. 

@artem-shelkovnikov
Copy link
Member

That's a good writeup, thanks! I'll give it a test and get back to you

@maxesse
Copy link
Contributor Author

maxesse commented Oct 16, 2025

I started a full sync now with the code from this PR, will let it run a few good hours, and see how it goes. So far so good 25k items in.

@artem-shelkovnikov
Copy link
Member

In the meantime I was able to confirm it works on our tenant - I created a Copilot Page and it introduced a /contentstorage site object that caused my sync to failed.

I've checked out to the changes from this PR and the sync ran successfully.

I think it's safe to merge this PR as is and your sync should finish well too. Since you've started one, we can delay the merge until you confirm that all works.

@maxesse
Copy link
Contributor Author

maxesse commented Oct 16, 2025

That's awesome Artem - the sync was still going, 150k items, by now it'd have failed already for sure, as i remember it was always failing at around 40k items after hitting some site starting with A... that had loop components etc. in it. I think we're good to go. Also nice because it means i can stop maintaining my own fork of the connector!

@artem-shelkovnikov artem-shelkovnikov merged commit 6290525 into elastic:main Oct 16, 2025
2 checks passed
github-actions bot pushed a commit that referenced this pull request Oct 16, 2025
…3630)

## Closes #3603

The change excludes specific /contentstorage/ urls from the sync in all
API calls to Sharepoint. These URLs should not be attempted to be
accessed as they're created internally by Sharepoint for Teams private
channels, loop components, etc. (it's fairly undocumented what they're
used for to be honest), and have a different permission model that will
cause 401 errors and the connector to stop syncing.


## Checklists

#### Pre-Review Checklist
- [x] this PR does NOT contain credentials of any kind, such as API keys
or username/passwords (double check `config.yml.example`)
- [x] this PR has a meaningful title
- [x] this PR links to all relevant github issues that it fixes or
partially addresses
- [x] this PR has a thorough description
- [x] Tested the changes locally
- [x] For bugfixes: backport safely to all minor branches still
receiving patch releases

## Release Note

Fixes an issue where a Sharepoint Online sync configured to crawl the
entire tenant by selecting * in the site list, might stop with 401
errors when trying to access URLs containing /contentstorage/.

---------

Co-authored-by: Artem Shelkovnikov <[email protected]>
github-actions bot pushed a commit that referenced this pull request Oct 16, 2025
…3630)

## Closes #3603

The change excludes specific /contentstorage/ urls from the sync in all
API calls to Sharepoint. These URLs should not be attempted to be
accessed as they're created internally by Sharepoint for Teams private
channels, loop components, etc. (it's fairly undocumented what they're
used for to be honest), and have a different permission model that will
cause 401 errors and the connector to stop syncing.


## Checklists

#### Pre-Review Checklist
- [x] this PR does NOT contain credentials of any kind, such as API keys
or username/passwords (double check `config.yml.example`)
- [x] this PR has a meaningful title
- [x] this PR links to all relevant github issues that it fixes or
partially addresses
- [x] this PR has a thorough description
- [x] Tested the changes locally
- [x] For bugfixes: backport safely to all minor branches still
receiving patch releases

## Release Note

Fixes an issue where a Sharepoint Online sync configured to crawl the
entire tenant by selecting * in the site list, might stop with 401
errors when trying to access URLs containing /contentstorage/.

---------

Co-authored-by: Artem Shelkovnikov <[email protected]>
github-actions bot pushed a commit that referenced this pull request Oct 16, 2025
…3630)

## Closes #3603

The change excludes specific /contentstorage/ urls from the sync in all
API calls to Sharepoint. These URLs should not be attempted to be
accessed as they're created internally by Sharepoint for Teams private
channels, loop components, etc. (it's fairly undocumented what they're
used for to be honest), and have a different permission model that will
cause 401 errors and the connector to stop syncing.


## Checklists

#### Pre-Review Checklist
- [x] this PR does NOT contain credentials of any kind, such as API keys
or username/passwords (double check `config.yml.example`)
- [x] this PR has a meaningful title
- [x] this PR links to all relevant github issues that it fixes or
partially addresses
- [x] this PR has a thorough description
- [x] Tested the changes locally
- [x] For bugfixes: backport safely to all minor branches still
receiving patch releases

## Release Note

Fixes an issue where a Sharepoint Online sync configured to crawl the
entire tenant by selecting * in the site list, might stop with 401
errors when trying to access URLs containing /contentstorage/.

---------

Co-authored-by: Artem Shelkovnikov <[email protected]>
github-actions bot pushed a commit that referenced this pull request Oct 16, 2025
…3630)

## Closes #3603

The change excludes specific /contentstorage/ urls from the sync in all
API calls to Sharepoint. These URLs should not be attempted to be
accessed as they're created internally by Sharepoint for Teams private
channels, loop components, etc. (it's fairly undocumented what they're
used for to be honest), and have a different permission model that will
cause 401 errors and the connector to stop syncing.


## Checklists

#### Pre-Review Checklist
- [x] this PR does NOT contain credentials of any kind, such as API keys
or username/passwords (double check `config.yml.example`)
- [x] this PR has a meaningful title
- [x] this PR links to all relevant github issues that it fixes or
partially addresses
- [x] this PR has a thorough description
- [x] Tested the changes locally
- [x] For bugfixes: backport safely to all minor branches still
receiving patch releases

## Release Note

Fixes an issue where a Sharepoint Online sync configured to crawl the
entire tenant by selecting * in the site list, might stop with 401
errors when trying to access URLs containing /contentstorage/.

---------

Co-authored-by: Artem Shelkovnikov <[email protected]>
github-actions bot pushed a commit that referenced this pull request Oct 16, 2025
…3630)

## Closes #3603

The change excludes specific /contentstorage/ urls from the sync in all
API calls to Sharepoint. These URLs should not be attempted to be
accessed as they're created internally by Sharepoint for Teams private
channels, loop components, etc. (it's fairly undocumented what they're
used for to be honest), and have a different permission model that will
cause 401 errors and the connector to stop syncing.


## Checklists

#### Pre-Review Checklist
- [x] this PR does NOT contain credentials of any kind, such as API keys
or username/passwords (double check `config.yml.example`)
- [x] this PR has a meaningful title
- [x] this PR links to all relevant github issues that it fixes or
partially addresses
- [x] this PR has a thorough description
- [x] Tested the changes locally
- [x] For bugfixes: backport safely to all minor branches still
receiving patch releases

## Release Note

Fixes an issue where a Sharepoint Online sync configured to crawl the
entire tenant by selecting * in the site list, might stop with 401
errors when trying to access URLs containing /contentstorage/.

---------

Co-authored-by: Artem Shelkovnikov <[email protected]>
Copy link

💚 Backport PR(s) successfully created

Status Branch Result
9.0 #3787
8.18 #3788
9.1 #3789
8.19 #3790
9.2 #3791

The backport PRs will be merged automatically after passing CI.

@artem-shelkovnikov
Copy link
Member

Thank you for your contribution, @maxesse!

I've set everything to backport the changes you've done to the branches that we'll be releasing - it'll be available in Docker as soon as release happens, or from source the moment we merge all the backports.

@maxesse
Copy link
Contributor Author

maxesse commented Oct 16, 2025

That's awesome Artem, happy to help!

artem-shelkovnikov added a commit that referenced this pull request Oct 16, 2025
…ector (#3630) (#3788)

Backports the following commits to 8.18:
- fix: Exclude /contentstorage/ URLs from Sharepoint Online Connector
(#3630)

Co-authored-by: Max Sanna <[email protected]>
Co-authored-by: Artem Shelkovnikov <[email protected]>
artem-shelkovnikov added a commit that referenced this pull request Oct 16, 2025
…ector (#3630) (#3790)

Backports the following commits to 8.19:
- fix: Exclude /contentstorage/ URLs from Sharepoint Online Connector
(#3630)

Co-authored-by: Max Sanna <[email protected]>
Co-authored-by: Artem Shelkovnikov <[email protected]>
artem-shelkovnikov added a commit that referenced this pull request Oct 16, 2025
…ctor (#3630) (#3791)

Backports the following commits to 9.2:
- fix: Exclude /contentstorage/ URLs from Sharepoint Online Connector
(#3630)

---------

Co-authored-by: Max Sanna <[email protected]>
Co-authored-by: Artem Shelkovnikov <[email protected]>
Co-authored-by: Elastic Machine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sharepoint Connector fails sync due to crawling /contentstorage/ URLs

2 participants